An Algorithm to Self-Extract Secondary Keywords and Their Combinations Based on Abstracts Collected using Primary Keywords from Online Digital Libraries

نویسندگان

  • Natarajan Meghanathan
  • Nataliya Kostyuk
  • Raphael D. Isokpehi
  • Hari Cohly
چکیده

The high-level contribution of this paper is the development and implementation of an algorithm to selfextract secondary keywords and their combinations (combo words) based on abstracts collected using standard primary keywords for research areas from reputed online digital libraries like IEEE Explore, PubMed Central and etc. Given a collection of N abstracts, we arbitrarily select M abstracts (M<< N; M/N as low as 0.15) and parse each of the M abstracts, word by word. Upon the first-time appearance of a word, we query the user for classifying the word into an Accept-List or non-Accept-List. The effectiveness of the training approach is evaluated by measuring the percentage of words for which the user is queried for classification when the algorithm parses through the words of each of the M abstracts. We observed that as M grows larger, the percentage of words for which the user is queried for classification reduces drastically. After the list of acceptable words is built by parsing the M abstracts, we now parse all the N abstracts, word by word, and count the frequency of appearance of each of the words in Accept-List in these N abstracts. We also construct a Combo-Accept-List comprising of all possible combinations of the single keywords in Accept-List and parse all the N abstracts, two successive words (combo word) at a time, and count the frequency of appearance of each of the combo words in the Combo-Accept-List in these N abstracts.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A survey of Compliance of Persian Abstracts of Research Articles in Medical Journals of Universities of Medical Sciences (Type 1) with the Vancouver Guideline and ISO 214 Standard

Background & Aims: An abstract is the best source of information within a content. The structured abstract is the best representation of the main source. Writing an appropriate abstract requires adherence to standards of abstracting. The aim of this study wasto investigate the compliance of Persian abstracts of research papers in medical journals of universities of medical sciences (type 1) wit...

متن کامل

An Approach to Clustering

Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. Howe...

متن کامل

میزان همخوانی کلیدواژه‌های مستخرج از چکیده با توصیفگرهای نمایه‌سازان در پایگاه «چکیده پایان‌نامه‌های ایران»

Purpose: This research is devoted to study the consistency between keywords extracted from abstracts of theses by the experts in the related fields and descriptors provided by the indexers in database of “Iran’s theses abstracts”. Methodology: This research is an applied study based on content analysis. A checklist which consisted of 32 criteria was used. In addition, we consulted the experts ...

متن کامل

An Optimized Online Secondary Path Modeling Method for Single-Channel Feedback ANC Systems

This paper proposes a new method for online secondary path modeling in feedback active noise control (ANC) systems. In practical cases, the secondary path is usually time-varying. For these cases, online modeling of secondary path is required to ensure convergence of the system. In literature the secondary path estimation is usually performed offline, prior to online modeling, where in the prop...

متن کامل

A Context-based Technique Using Tag-tree for an Effective Retrieval from a Digital Literature Collection

The increasing growth of information in online digital libraries causes an increasing need to develop techniques to retrieve. In the digital library, findability-finding the user required information is a hectic task than those of usability. The major issues in findability are (a) topic diffusion: results of a traditional keyword based search, often leads to multiple topic areas, some of which ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1006.1184  شماره 

صفحات  -

تاریخ انتشار 2010